Class-based Prediction Errors to Categorize Text with Out-of-vocabulary Words

نویسندگان

  • Joan Serrà
  • Ilias Leontiadis
  • Dimitris Spathis
  • Gianluca Stringhini
  • Jeremy Blackburn
چکیده

Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. To better deal with these issues, we propose to use the error signal of class-based language models as input to text classification algorithms. In particular, we train a next-character prediction model for any given class, and then exploit the error of such class-based models to inform a neural network classifier. This way, we shift from the ‘ability to describe’ seen documents to the ‘ability to predict’ unseen content. Preliminary studies using out-of-vocabulary splits from abusive tweet data show promising results, outperforming competitive text categorization strategies by 4–11%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Class-based Prediction Errors to Detect Hate Speech with Out-of-vocabulary Words

Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. A paradigmatic example of this situation is abusive online behavior, with social networks and media platforms struggling to ef...

متن کامل

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity

Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they missrecognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we pre...

متن کامل

L2 Vocabulary Learning and the Use of Reading Tasks: Manipulating the Involvement Load Index

As Schmidt (2008) states, deeper engagement with new vocabulary as induced by tasks clearly increases the chances of learning those words. This engagement is theoretically clarified by the involvement load hypothesis (ILH, Laufer and Hulstijn, 2001), based on which the involvement index of each task can be measured. The present study was designed to test ILH by evaluating the impact of 4 differ...

متن کامل

L2 Vocabulary Learning and the Use of Reading Tasks: Manipulating the Involvement Load Index

As Schmidt (2008) states, deeper engagement with new vocabulary as induced by tasks clearly increases the chances of learning those words. This engagement is theoretically clarified by the involvement load hypothesis (ILH, Laufer and Hulstijn, 2001), based on which the involvement index of each task can be measured. The present study was designed to test ILH by evaluating the impact of 4 differ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017